One of the key activities of any IT function is to ensure there is no impact to the Business operations. IT leverages Incident Management process to achieve the above Objective. An incident is something that is unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business. The main goal of Incident Management process is to provide a quick fix / workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact.
In most of the organizations, incidents are created by various Business and IT Users, End Users/ Vendors if they have access to ticketing systems, and from the integrated monitoring systems and tools. Assigning the incidents to the appropriate person or unit in the support team has critical importance to provide improved user satisfaction while ensuring better allocation of support resources.
Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. On the other hand, manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
In the support process, incoming incidents are analyzed and assessed by organization’s support teams to fulfill the request. In many organizations, better allocation and effective usage of the valuable support resources will directly result in substantial cost savings.
Currently the incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools) within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). This team will review the incidents for right ticket categorization, priorities and then carry out initial diagnosis to see if they can resolve. Around ~54% of the incidents are resolved by L1 / L2 teams. Incase L1 / L2 is unable to resolve, they will then escalate / assign the tickets to Functional teams from Applications and Infrastructure (L3 teams). Some portions of incidents are directly assigned to L3 teams by either Monitoring tools or Callers / Requestors. L3 teams will carry out detailed diagnosis and resolve the incidents. Around ~56% of incidents are resolved by Functional / L3 teams. Incase if vendor support is needed, they will reach out for their support towards incident closure.
L1 / L2 needs to spend time reviewing Standard Operating Procedures (SOPs) before assigning to Functional teams (Minimum ~25-30% of incidents needs to be reviewed for SOPs before ticket assignment). 15 min is being spent for SOP review for each incident. Minimum of ~1 FTE effort needed only for incident assignment to L3 teams. During the process of incident assignments by L1 / L2 teams to functional groups, there were multiple instances of incidents getting assigned to wrong functional groups. Around ~25% of Incidents are wrongly assigned to functional teams. Additional effort needed for Functional teams to re-assign to right functional groups. During this process, some of the incidents are in queue and not addressed timely resulting in poor customer service.
This major project aims to reduce the manual intervention of IT operations or Service Desk teams by automating the ticket allocation process. The goal is to create a text-based ML model that can automatically classify any new tickets by analyzing the ticket description into one of the appropriate assignment group, which can later be integrated into any ITSM tool like Service Now. Based on the description of the ticket, our model outputs the probability of assigning it to one of 74 groups.
Build Multi-Class classifier that can classify the tickets by analysing text. Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.
Milestones
Exploring the given Data files Understanding the structure of data Missing points in data Finding inconsistencies in the data Visualizing different patterns Visualizing different text features Dealing with missing values Text preprocessing Creating word vocabulary from the corpus of report text data Creating tokens as required
Building a model architecture which can classify. Trying different model architectures by researching state of the art for similar tasks. Train the model To deal with large training time, save the weights so that you can use them when training the model for the second time without starting from scratch.
Test the model and report as per evaluation metrics Try different models Try different evaluation metrics Set different hyper parameters, by trying different optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc..for these models to fine-tune them Report evaluation metrics for these models along with your observation on how changing different hyper parameters leads to change in the final evaluation metric.
The solution would be implemented using the approach below:
– Using a traditional machine learning algorithm, we would classify the tickets into one of the groups of over 100 tickets.
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from IPython.display import display
pd.options.display.max_columns = None
pd.options.display.max_rows = None
! pip install xlrd
df = pd.read_excel("/content/drive/MyDrive/ColabNotebooks/PGP_AIML_Data/Capstone project/Group_10 - NLP1 Project Common WorkSpace/Input+Data+Synthetic+_28created+but+not+used+in+our+project_29.xlsx")
#Display the data head
df.head()
df.shape
df.info()
df_incidents = df.drop('Caller',axis=1)
df_incidents.head(5)
df_incidents['Assignment group'].unique()
len(df_incidents['Assignment group'].unique())
df_inc = df_incidents['Assignment group'].value_counts().reset_index()
df_inc['percentage'] = (df_inc['Assignment group']/df_inc['Assignment group'].sum())*100
df_inc.head()
sample = df_incidents.groupby(['Assignment group'])
regroup=[]
for grp in df_incidents['Assignment group'].unique():
if(sample.get_group(grp).shape[0]<10):
regroup.append(grp)
print('Found {} groups which have under 10 samples'.format(len(regroup)))
#print("\n\nGroups with less than 10 samples\n\n",len(regroup))
df_incidents['Assignment group']=df_incidents['Assignment group'].apply(lambda x : 'misc_grp' if x in regroup else x)
# Unique Groups check
print("\n\nUnique Groups\n\n",df_incidents['Assignment group'].unique())
# Groups with less than 10 samples
print("\n\nGroups with less than 10 samples\n\n", len(regroup))
# Plot to visualize the percentage data distribution across different groups
sns.set(style="whitegrid")
plt.figure(figsize=(20,5))
ax = sns.countplot(x="Assignment group", data=df_incidents, order=df_incidents["Assignment group"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
for p in ax.patches:
ax.annotate(str(format(p.get_height()/len(df_incidents.index)*100, '.2f')+"%"), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'bottom', rotation=90, xytext = (0, 10), textcoords = 'offset points')
df_top_20 = df_incidents['Assignment group'].value_counts().nlargest(20).reset_index()
df_top_20
plt.figure(figsize=(12,6))
bars = plt.bar(df_top_20['index'],df_top_20['Assignment group'])
plt.title('Top 20 Assignment groups with highest number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()
df_bottom_20 = df_incidents['Assignment group'].value_counts().nsmallest(20).reset_index()
df_bottom_20
plt.figure(figsize=(12,6))
bars = plt.bar(df_bottom_20['index'],df_bottom_20['Assignment group'])
plt.title('Bottom 20 Assignment groups with small number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()
df_bins = pd.DataFrame(columns=['Description','Ticket Count'])
one_ticket = {'Description':'1 ticket','Ticket Count':len(df_inc[df_inc['Assignment group'] < 2])}
_2_5_ticket = {'Description':'2-5 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 1)& (df_inc['Assignment group'] < 6) ])}
_10_ticket = {'Description':' 6-10 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 5)& (df_inc['Assignment group'] < 11)])}
_10_20_ticket = {'Description':' 11-20 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 10)& (df_inc['Assignment group'] < 21)])}
_20_50_ticket = {'Description':' 21-50 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 20)& (df_inc['Assignment group'] < 51)])}
_51_100_ticket = {'Description':' 51-100 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 50)& (df_inc['Assignment group'] < 101)])}
_100_ticket = {'Description':' >100 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 100)])}
#append row to the dataframe
df_bins = df_bins.append([one_ticket,_2_5_ticket,_10_ticket,
_10_20_ticket,_20_50_ticket,_51_100_ticket,_100_ticket], ignore_index=True)
df_bins
plt.figure(figsize=(10, 8))
plt.pie(df_bins['Ticket Count'],labels=df_bins['Description'],autopct='%1.1f%%', startangle=15, shadow = True);
plt.title('Assignment Groups Distribution')
plt.axis('equal');
-You can see that the dataset has 6 allocation groups with only one ticket
-100 There are 15 quota groups with more than 100 tickets.
-Only 20% of the assigned groups have more than 100 tickets.
There are missing values in the dataset, within 'Short decription' and 'Description' columns, lets view the missing values and impute them.
df_incidents[df_incidents['Short description'].isnull()]
df_incidents[df_incidents['Description'].isnull()]
Replace the -nan value with '' (empty string)
Next, connect Short Description and Description to form a column called "New_Description" and use it as a prediction.
Don't miss the information you need about your ticket.
#Replace NaN values in Short Description and Description columns
df_incidents['Short description'] = df_incidents['Short description'].replace(np.nan, '', regex=True)
df_incidents['Description'] = df_incidents['Description'].replace(np.nan, '', regex=True)
df_incidents.info()
#Concatenate Short Description and Description columns
df_incidents['New_Description'] = df_incidents['Short description'] + ' ' +df_incidents['Description']
df_incidents.head()
df_incidents.isnull().sum()
df_incidents_level = df_incidents.copy()
df_incidents_level['Target'] = np.where(df_incidents_level['Assignment group']=='GRP_0','L1/L2',np.where(df_incidents_level['Assignment group'] =='GRP_8','L1/L2','L3'))
df_incidents.head(5)
df_incidents_level.head(5)
df_incidents_level.Target.value_counts()
x=df_incidents_level.Target.value_counts()
sns.barplot(x.index,x)
plt.gca().set_ylabel('samples')
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
text_len=df_incidents_level[df_incidents_level['Target']=='L1/L2']['Short description'].str.len()
ax1.hist(text_len.dropna(),color='yellow')
ax1.set_title('L1/L2')
text_len=df_incidents_level[df_incidents_level['Target']=='L3']['Short description'].str.len()
ax2.hist(text_len.dropna(),color='blue')
ax2.set_title('L3')
fig.suptitle('Characters in short description')
plt.show()
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
text_len=df_incidents_level[df_incidents_level['Target']=='L1/L2']['Short description'].str.split().map(lambda x: len(str(x).split(" ")))
ax1.hist(text_len.dropna(),color='magenta')
ax1.set_title('L1/L2')
text_len=df_incidents_level[df_incidents_level['Target']=='L3']['Short description'].str.split().map(lambda x: len(str(x).split(" ")))
ax2.hist(text_len.dropna(),color='red')
ax2.set_title('L3')
fig.suptitle('Words in short description')
plt.show()
df_incidents_level['Short description']=df_incidents_level['Short description'].apply(str)
def ave_word_len(sentence):
words = sentence.split(" ")
return ((sum((len(word) for word in words))/len(words)))
df_incidents_level["short_description_avg_word_len"] = df_incidents_level["Short description"].apply(ave_word_len)
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
word1=df_incidents_level[df_incidents_level['Target']=='L1/L2']['Short description'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word1.map(lambda x: np.mean(x)),ax=ax1,color='yellow')
ax1.set_title('L1/L2')
#print("\n\n",word,"\n\n")
word2=df_incidents_level[df_incidents_level['Target']=='L3']['Short description'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word2.map(lambda x: np.mean(x)),ax=ax2,color='indigo')
ax2.set_title('L3')
fig.suptitle('Average word length in each incident');
#print("\n\n",word.size,"\n\n")
df_incidents_level["short_description_nupper1"] = df_incidents_level[df_incidents_level['Target']=='L1/L2']["Short description"].apply((lambda word1: len([x for x in word1.split() if x.isupper()])))
df_incidents_level["short_description_nupper2"] = df_incidents_level[df_incidents_level['Target']=='L3']["Short description"].apply((lambda word2: len([x for x in word2.split() if x.isupper()])))
print("Nulls for L1/L2",df_incidents_level["short_description_nupper1"].isna().sum(),"\n")
print("Nulls for L3",df_incidents_level["short_description_nupper2"].isna().sum(),"\n")
df_incidents_level["short_description_nupper1"] = df_incidents_level["short_description_nupper1"].fillna(0)
df_incidents_level["short_description_nupper2"] = df_incidents_level["short_description_nupper2"].fillna(0)
print("Nulls for L1/L2 after removing nulls",df_incidents_level["short_description_nupper1"].isna().sum(),"\n")
print("Nulls for L3 after removing nulls",df_incidents_level["short_description_nupper2"].isna().sum(),"\n")
df_incidents_level["short_description_nupper"] = df_incidents_level["short_description_nupper1"] + df_incidents_level["short_description_nupper2"]
print("size of the column with both L1/L2 and L3 tickets",df_incidents_level["short_description_nupper"].size)
df_incidents_level["short_description_nupper"].size
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
text_len=df_incidents_level[df_incidents_level['Target']=='L1/L2']['short_description_nupper']
ax1.hist(text_len.dropna(),color='magenta')
ax1.set_title('L1/L2')
text_len=df_incidents_level[df_incidents_level['Target']=='L3']['short_description_nupper']
ax2.hist(text_len.dropna(),color='indigo')
ax2.set_title('L3')
fig.suptitle('Number of upper case words')
plt.show()
df_incidents_level["short_description_ndigits"] = df_incidents_level["Short description"].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df_incidents_level[["Short description","short_description_ndigits"]].sort_values(by = "short_description_ndigits",ascending = False).head()
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
text_len=df_incidents_level[df_incidents_level['Target']=='L1/L2']['short_description_ndigits']
ax1.hist(text_len.dropna(),color='green')
ax1.set_title('L1/L2')
text_len=df_incidents_level[df_incidents_level['Target']=='L3']['short_description_ndigits']
ax2.hist(text_len.dropna(),color='magenta')
ax2.set_title('L3')
fig.suptitle('No of digits in short description')
plt.show()
df_incidents_level[df_incidents_level['Assignment group']=='GRP_24'].New_Description
df_incidents_level.head()
#Lets encode the string, to make it easier to be passed to language detection api.
def fn_decode_to_ascii(df):
text = df.encode().decode('utf-8').encode('ascii', 'ignore')
return text.decode("utf-8")
df_incidents_level['New_Description'] = df_incidents_level['New_Description'].apply(fn_decode_to_ascii)
! pip install langdetect
from langdetect import detect
def fn_lan_detect(df):
try:
return detect(df)
except:
return 'no'
df_incidents_level['language'] = df_incidents_level['New_Description'].apply(fn_lan_detect)
df_incidents_level["language"].value_counts()
x = df_incidents_level["language"].value_counts()
x=x.sort_index()
plt.figure(figsize=(10,6))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Distribution of text by language")
plt.ylabel('number of records')
plt.xlabel('Language')
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show();
!pip install googletrans
import googletrans
from googletrans import Translator
print(googletrans.LANGUAGES)
# Function to translate the text to english.
def fn_translate(df,lang):
try:
if lang == 'en':
return df
else:
return translator.translate(df).text
except:
return df
df_incidents_level['English_Description'] = df_incidents_level.apply(lambda x: fn_translate(x['New_Description'], x['language']), axis=1)
Note: Google Translate API is used for translating the german text, however there is limit imposed from Google on the number of requests from a particular ip address. So the traslation was done in batches and save to a file. Which will be used for further processing
df_incidents_level[df_incidents_level["Short description"].str.contains("account lock")]["Assignment group"].value_counts()
df_incidents_level[df_incidents_level["Short description"].str.contains("oneteam")]["Assignment group"].value_counts()
df_incidents_level.head(5)
df_incidents_level.to_csv('/content/drive/MyDrive/ColabNotebooks/PGP_AIML_Data/Capstone project/Group_10 - NLP1 Project Common WorkSpace/inc_tranlated.csv')
Before we start with any NLP project we need to pre-process the data to get it all in a consistent format.We need to clean, tokenize and convert our data into a matrix.
df_tranlated_text = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/PGP_AIML_Data/Capstone project/Group_10 - NLP1 Project Common WorkSpace/inc_tranlated.csv',encoding='utf-8')
df_tranlated_inc = df_tranlated_text.drop(['Short description','Description','New_Description'],axis=1)
df_tranlated_inc.English_Description=df_tranlated_inc.English_Description.astype(str)
df_tranlated_inc.head()
df_tranlated_inc.info()
import string
import re
from collections import Counter
from nltk.corpus import stopwords
### Make text lowercase, remove text in square brackets,remove links,remove punctuation and remove words containing numbers
def clean_text(text):
'''Make text lowercase, remove text in square brackets,remove links,remove punctuation
and remove words containing numbers.'''
text=text.replace(('first name: ').lower(),'firstname')
text=text.replace(('last name: ').lower(),'lastname')
text=text.replace(('received from:').lower(),'')
text=text.replace('email:','')
text=text.replace('email address:','')
index1=text.find('from:')
index2=text.find('\nsddubject:')
text=text.replace(text[index1:index2],'')
index3=text.find('[cid:image')
index4=text.find(']')
text=text.replace(text[index3:index4],'')
text=text.replace('subject:','')
text=text.replace('received from:','')
text=text.replace('this message was sent from an unmonitored email address', '')
text=text.replace('please do not reply to this message', '')
text=text.replace('monitoring_tool@company.com','MonitoringTool')
text=text.replace('select the following link to view the disclaimer in an alternate language','')
text=text.replace('description problem', '')
text=text.replace('steps taken far', '')
text=text.replace('customer job title', '')
text=text.replace('sales engineer contact', '')
text=text.replace('description of problem:', '')
text=text.replace('steps taken so far', '')
text=text.replace('please do the needful', '')
text=text.replace('please note that ', '')
text=text.replace('please find below', '')
text=text.replace('date and time', '')
text=text.replace('kindly refer mail', '')
text=text.replace('name:', '')
text=text.replace('language:', '')
text=text.replace('customer number:', '')
text=text.replace('telephone:', '')
text=text.replace('summary:', '')
text=text.replace('sincerely', '')
text=text.replace('company inc', '')
text=text.replace('importance:', '')
text=text.replace('gmail.com', '')
text=text.replace('company.com', '')
text=text.replace('microsoftonline.com', '')
text=text.replace('company.onmicrosoft.com', '')
text=text.replace('hello', '')
text=text.replace('hallo', '')
text=text.replace('hi it team', '')
text=text.replace('hi team', '')
text=text.replace('hi', '')
text=text.replace('best', '')
text=text.replace('kind', '')
text=text.replace('regards', '')
text=text.replace('good morning', '')
text=text.replace('please', '')
text=text.replace('regards', '')
text = re.sub(r'\S+@\S+', '', text)
custom_punctuation='!"#$%&\'()*+,-./:;<=>?@[\\]^`{|}~'
text = re.sub(r'\w*\d\w*', '', text)
text = re.sub(r'\[.*?\]', '', text)
text = re.sub(r'https?://\S+|www\.\S+', '', text)
text = re.sub(r'<.*?>+', '', text)
text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
text = re.sub(r'\r\n', '', text)
text = re.sub(r'\n', '', text)
text = re.sub(r'\S+@\S+', '', text)
text = re.sub("\d+", "", text)
text = text.lower()
return text
df_tranlated_inc['cleaned_description'] = df_tranlated_inc['English_Description'].apply(lambda x: clean_text(x))
df_tranlated_inc.drop(['English_Description'],axis=1,inplace=True)
df_tranlated_inc['cleaned_description'].head()
For all types of natural language processing, one of the most useful visualization tools for data scientists is the "Word Cloud" scheme. A word cloud (as its name suggests) is an image of a combination of individual words that can form a text or book, and the size of each word is proportional to the frequency (number of repetitions) of the words of that text. increase. ... word is displayed). Here our words can be easily taken from the "text" column
def f_word_cloud(column):
comment_words = ' '
stopwords = set(STOPWORDS)
# iterate through the csv file
for val in column:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
for words in tokens:
comment_words = comment_words + words + ' '
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
return wordcloud
from wordcloud import WordCloud, STOPWORDS
wordcloud = f_word_cloud(df_tranlated_inc.cleaned_description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
wordcloud = f_word_cloud(df_tranlated_inc[df_tranlated_inc['Assignment group']=='GRP_0'].cleaned_description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
GRP_0 seems to have tickets related to password resets, access problems, login problems, connection problems, etc.
wordcloud = f_word_cloud(df_tranlated_inc[df_tranlated_inc['Assignment group']=='GRP_8'].cleaned_description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
GRP_8 seems to have tickets related to outage, job failures, monitoring tool etc
wordcloud = f_word_cloud(df_tranlated_inc[df_tranlated_inc['Assignment group']=='GRP_12'].cleaned_description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
GRP_12 contains tickets related to systems like disk space issues, t network issues like tie out, citrix issue, connectivity timeout etc.
wordcloud = f_word_cloud(df_tranlated_inc[df_tranlated_inc['Assignment group']=='GRP_24'].cleaned_description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
GRP_24 - Tickets are mainly in german, these tickets need to be translated to english before passing it to our model.
Now, let's get rid of the stopwords i.e words which occur very frequently but have no possible value like a, an, the, are etc.
## Removal of Stop Words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
df_tranlated_inc['cleaned_description'] = df_tranlated_inc['cleaned_description'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))
df_tranlated_inc['cleaned_description'].head()
## Lemmatization
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from textblob import Word
df_tranlated_inc['cleaned_description']= df_tranlated_inc['cleaned_description'].apply(lambda x: " ".join([Word(word).lemmatize() for word in str(x).split()]))
df_tranlated_inc['cleaned_description'].head()
df_tranlated_inc['num_wds'] = df_tranlated_inc['cleaned_description'].apply(lambda x: len(x.split()))
df_tranlated_inc['num_wds'].mean()
print(df_tranlated_inc['num_wds'].max())
print(df_tranlated_inc['num_wds'].min())
len(df_tranlated_inc[df_tranlated_inc['num_wds']==0])
df_tranlated_inc= df_tranlated_inc[df_tranlated_inc['num_wds']>1]
print(df_tranlated_inc['num_wds'].max())
print(df_tranlated_inc['num_wds'].min())
def avg_word(sentence):
words = sentence.split()
return (sum(len(word) for word in words)/len(words))
df_tranlated_inc['avg_word'] = df_tranlated_inc['cleaned_description'].apply(lambda x: avg_word(str(x)))
df_tranlated_inc.head()
ax=df_tranlated_inc['num_wds'].plot(kind='hist', bins=50, fontsize=14, figsize=(12,10))
ax.set_title('Description Length in Words\n', fontsize=20)
ax.set_ylabel('Frequency', fontsize=18)
ax.set_xlabel('Number of Words', fontsize=18);
df_tranlated_inc['uniq_wds'] = df_tranlated_inc['cleaned_description'].str.split().apply(lambda x: len(set(x)))
df_tranlated_inc['uniq_wds'].head()
print(df_tranlated_inc['uniq_wds'].mean())
print(df_tranlated_inc['uniq_wds'].min())
print(df_tranlated_inc['uniq_wds'].max())
ax=df_tranlated_inc['uniq_wds'].plot(kind='hist', bins=50, fontsize=14, figsize=(12,10))
ax.set_title('Unique Words Per Incident\n', fontsize=20)
ax.set_ylabel('Frequency', fontsize=18)
ax.set_xlabel('Number of Unique Words', fontsize=18);
When we plot this into a chart, we can see that while the distribution of unique words is still skewed, it looks a bit similar to the distribution based on total word counts we generated earlier.
assign_grps = df_tranlated_inc.groupby('Assignment group')
ax=assign_grps['num_wds'].aggregate(np.mean).plot(kind='bar', fontsize=14, figsize=(20,10))
ax.set_title('Mean Number of Words in tickets per Assignment Group\n', fontsize=20)
ax.set_ylabel('Mean Number of Words', fontsize=18)
ax.set_xlabel('Assignment Group', fontsize=18);
ax=assign_grps['uniq_wds'].aggregate(np.mean).plot(kind='bar', fontsize=14, figsize=(20,10))
ax.set_title('Mean Number of Unique Words per tickets in Assignment Group\n', fontsize=20)
ax.set_ylabel('Mean Number of Unique Words', fontsize=18)
ax.set_xlabel('Assignment Group', fontsize=18);
Finally, let’s look at the most common words over the entire corpus.
wd_counts = Counter()
for i, row in df_tranlated_inc.iterrows():
wd_counts.update(row['cleaned_description'].split())
wd_counts.most_common(20)
Above, we can see some pretty predictable words
Tokenization is a process that splits an input sequence into so-called tokens where the tokens can be a word, sentence, paragraph etc.
import nltk
# Tokenizing the training and the test set
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
df_tranlated_inc['token_desc'] = df_tranlated_inc['cleaned_description'].apply(lambda x: tokenizer.tokenize(x))
df_tranlated_inc['token_desc'].head()
# After preprocessing, the text format
def combine_text(list_of_text):
'''Takes a list of text and combines them into one large chunk of text.'''
combined_text = ' '.join(list_of_text)
return combined_text
df_tranlated_inc['token_desc'] = df_tranlated_inc['token_desc'].apply(lambda x : combine_text(x))
df_tranlated_inc.info()
df_tranlated_inc.to_csv("/content/drive/MyDrive/ColabNotebooks/PGP_AIML_Data/Capstone project/Group_10 - NLP1 Project Common WorkSpace/cleanedData.csv")
df_tranlated_inc1 = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/PGP_AIML_Data/Capstone project/Group_10 - NLP1 Project Common WorkSpace/cleanedData.csv',encoding='utf-8')
df_tranlated_inc1.head()
df_tranlated_inc1.columns
df_tranlated_inc2= df_tranlated_inc1.drop(['Unnamed: 0','Unnamed: 0.1','short_description_avg_word_len', 'short_description_nupper','short_description_ndigits'], axis=1)
df_tranlated_inc2.head(5)
df_tranlated_inc2.to_csv("/content/drive/MyDrive/ColabNotebooks/PGP_AIML_Data/Capstone project/Group_10 - NLP1 Project Common WorkSpace/cleanedData.csv")
df_tranlated_inc2.columns
from sklearn.feature_extraction.text import TfidfVectorizer
# word level tf-idf for ticket
tfidf = TfidfVectorizer(max_features=250, analyzer = 'word', min_df=2, max_df=0.95, ngram_range=(1, 2))
inc_tfidf = tfidf.fit_transform(df_tranlated_inc2['token_desc'])
len(inc_tfidf.todense())
# create a dictionary mapping the tokens to their tfidf values
tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))
tfidf = pd.DataFrame(columns=['tfidf']).from_dict(
dict(tfidf), orient='index')
tfidf.columns = ['tfidf']
Below is the 10 tokens with the lowest tfidf score, which is unsurprisingly, very generic words that we could not use to distinguish one description from another.
tfidf.sort_values(by=['tfidf'], ascending=True).head(10)
Below is the 10 tokens with the highest tfidf score, which includes words that are a lot specific that by looking at them, we could guess the categories that they belong to:
tfidf.sort_values(by=['tfidf'], ascending=False).head(20)
plt.figure(figsize=(15,7))
sns.distplot(tfidf["tfidf"])
Given the high dimension of our tfidf matrix, we need to reduce their dimension using the Singular Value Decomposition (SVD) technique. And to visualize our vocabulary, we could next use t-SNE to reduce the dimension to 2. t-SNE is more suitable for dimensionality reduction to 2 or 3.
from sklearn.decomposition import TruncatedSVD
n_comp=10
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(inc_tfidf)
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)
tfidf_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])
plt.scatter(tfidf_df.x, tfidf_df.y, alpha=0.7)
We can see there are multiple smaller clusters here, each cluster could be the type of tickets that we have in the dataset.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# create count vectorizer first
cvectorizer = CountVectorizer(min_df=4, max_features=4000, ngram_range=(1,2))
cvz = cvectorizer.fit_transform(df_tranlated_inc2['token_desc'])
# generate topic models using Latent Dirichlet Allocation
lda_model = LatentDirichletAllocation(n_components=10, learning_method='online', max_iter=20, random_state=42)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 10
topic_summaries = []
# get topics and topic terms
topic_word = lda_model.components_
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i, ' | '.join(topic_words)))
We can see that password related tickets are classified as topic 3, account related tickets in topic 7 , job scheduler related tickets in topic 8 etc
# collect the tfid matrix in numpy array
array = inc_tfidf.todense()
# store the tf-idf array into pandas dataframe
df_inc = pd.DataFrame(array)
df_inc.head(10)
df_tranlated_inc2.head()
df_inc['num_wds']= df_tranlated_inc2['num_wds']
df_inc['avg_word']= df_tranlated_inc2['avg_word']
df_inc['Assignment group']= df_tranlated_inc2['Assignment group']
df_inc.head()
features = df_inc.columns.tolist()
output = 'Assignment group'
# removing the output and the id from features
features.remove(output)
df_inc_sample = df_inc[df_inc['Assignment group'].map(df_inc['Assignment group'].value_counts()) > 100]
df_inc_sample.shape
df_inc_sample['Assignment group'].value_counts()
def multiclass_logloss(actual, predicted, eps=1e-15):
"""Multi class version of Logarithmic Loss metric.
:param actual: Array containing the actual target classes
:param predicted: Matrix with class predictions, one probability per class
"""
# Convert 'actual' to a binary array if it's not already:
if len(actual.shape) == 1:
actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
for i, val in enumerate(actual):
actual2[i, val] = 1
actual = actual2
clip = np.clip(predicted, eps, 1 - eps)
rows = actual.shape[0]
vsota = np.sum(actual * np.log(clip))
return -1.0 / rows * vsota
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
df_inc_sample = df_tranlated_inc2[df_tranlated_inc2['Assignment group'].map(df_tranlated_inc2['Assignment group'].value_counts()) > 100]
x = df_inc_sample['token_desc']
y = df_inc_sample['Assignment group']
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
# encoding train labels
encoder.fit(y)
y = encoder.transform(y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=13,stratify=y)
log_cols=["Classifier", "accuracy","f1_score"]
log = pd.DataFrame(columns=log_cols)
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
nb.fit(X_train, y_train)
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)
predictions = nb.predict_proba(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print('f1 score %s' % f1_score(y_pred, y_test,average='weighted'))
print ("logloss: %0.3f " % multiclass_logloss(y_test,predictions))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
log_entry = pd.DataFrame([["MultinomialNB",accuracy_score(y_pred, y_test),f1_score(y_pred, y_test,average='weighted')]], columns=log_cols)
log = log.append(log_entry)
from sklearn.svm import LinearSVC
svc = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC(loss='hinge',random_state=42))),
])
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print('f1 score %s' % f1_score(y_pred, y_test,average='weighted'))
#print (focal_loss(alpha=.25, gamma=2))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
log_entry = pd.DataFrame([["LinearSVC",accuracy_score(y_pred, y_test),f1_score(y_pred, y_test,average='weighted')]], columns=log_cols)
log = log.append(log_entry)
from sklearn.linear_model import SGDClassifier
sgd = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
])
sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print('f1 score %s' % f1_score(y_pred, y_test,average='weighted'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
log_entry = pd.DataFrame([["SGDClassifier",accuracy_score(y_pred, y_test),f1_score(y_pred, y_test,average='weighted')]], columns=log_cols)
log = log.append(log_entry)
from sklearn.linear_model import LogisticRegression
logreg = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5)),
])
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
predictions = logreg.predict_proba(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print('f1 score %s' % f1_score(y_pred, y_test,average='weighted'))
print ("logloss: %0.3f " % multiclass_logloss(y_test,predictions))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
log_entry = pd.DataFrame([["LogisticRegression",accuracy_score(y_pred, y_test),f1_score(y_pred, y_test,average='weighted')]], columns=log_cols)
log = log.append(log_entry)
Random Forest
from sklearn.ensemble import RandomForestClassifier
rvc = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', RandomForestClassifier(n_estimators=100)),
])
rvc.fit(X_train, y_train)
y_pred = rvc.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print('f1 score %s' % f1_score(y_pred, y_test,average='weighted'))
#print (focal_loss(alpha=.25, gamma=2))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
log_entry = pd.DataFrame([["RandomForestClassifier",accuracy_score(y_pred, y_test),f1_score(y_pred, y_test,average='weighted')]], columns=log_cols)
log = log.append(log_entry)
import xgboost as xgb
xgboost = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)),
])
xgboost.fit(X_train, y_train)
y_pred = xgboost.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print('f1 score %s' % f1_score(y_pred, y_test,average='weighted'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
log_entry = pd.DataFrame([["Xgboost",accuracy_score(y_pred, y_test),f1_score(y_pred, y_test,average='weighted')]], columns=log_cols)
log = log.append(log_entry)
log.set_index(["Classifier"],inplace=True)
log.sort_values(by=['f1_score'])
log.sort_values(by=['f1_score']).plot(kind='barh',figsize=[7,6])
The data contains a lot of noise, for example, a few tickets related to account setup are spread across multiple assignment groups.
We cleaned the data and preprocessed it.
Tokenization: Tokenization is simply a term used to describe the process of converting normal text strings into a list of tokens, or words that we actually want. Sentence tokenizer can be used to find a list of sentences, and Word tokenizer can be used to find a list of words in strings.
We then used the cleaned and preprocessed dataset to run a basic benchmark model.
Because the dataset is highly skewed, we only considered a subset of groups for predictions. In 74 groups, 46 percent of tickets belong to Group 1 and 16 groups only have more than 100 tickets; the remaining Assignment groups have very low ticket counts that may not add much value to the model prediction. If we conducted random sampling across all subcategories, we would risk missing all of the tickets in some categories.
As a result, we considered groups with more than 100 tickets.
However, it appears that the call is biassed towards GRP 0, which has the majority of samples.
from google.colab import drive
drive.mount('/content/drive/')
projectpath = '/content/drive/My Drive/DataForColab/'
file_name ='cleanedData.csv'
df=pd.read_csv(projectpath + file_name)
df.head()
df_cleaned = pd.DataFrame()
df_cleaned['Assignment group'] = df['Assignment group']
df_cleaned['cleaned_description'] = df['cleaned_description']
# less than 10 tickets group merging into one misc_grp
sample = df_cleaned.groupby(['Assignment group'])
regroup=[]
for grp in df_cleaned['Assignment group'].unique():
if(sample.get_group(grp).shape[0]<10):
regroup.append(grp)
print('Found {} groups which have under 10 samples'.format(len(regroup)))
df_cleaned['Assignment group']=df_cleaned['Assignment group'].apply(lambda x : 'misc_grp' if x in regroup else x)
# Unique Groups in cleaned dataset
df_cleaned['Assignment group'].unique()
# Lemmetization and stop word removal
from nltk.corpus import stopwords
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
sr = stopwords.words('english')
for i,text in enumerate(df_cleaned['cleaned_description']):
df_cleaned['cleaned_description'][i]=" ".join(word for word in text.split(' ') if word not in sr)
# install spacy and plt for gensim
!pip install -q spacy
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
def lemmatize_text(text):
doc = nlp(text)
return ' '.join([token.lemma_ for token in doc])
df_cleaned['cleaned_description'] = df_cleaned['cleaned_description'].apply(lemmatize_text)
# Label encoding
from sklearn import preprocessing
def labelencoder(dataframe) :
label_encoder = preprocessing.LabelEncoder()
dataframe= label_encoder.fit_transform(dataframe)
grp_mapping = dict(zip(label_encoder.transform(label_encoder.classes_), label_encoder.classes_))
return dataframe,grp_mapping
df_cleaned['Assignment group'],grp_mapping_all_raw = labelencoder(df_cleaned['Assignment group'])
Modelling
from gensim.models import Word2Vec
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D,GRU,Conv1D,MaxPooling1D
from tensorflow.keras.models import Model, Sequential
import tensorflow as tf
from sklearn import metrics
from tensorflow.keras import backend as K
import matplotlib.pyplot as plt
from tensorflow.keras.utils import plot_model
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Add Function to capture results from each model
import operator
def captureData(dataframe,modelHistory,modelName,descriptions,index_df,resetData):
if resetData == 1:
tempResultsDf=pd.DataFrame()
dataframe=pd.DataFrame()
else:
index, acc_value = max(enumerate(modelHistory.history['val_accuracy']), key=operator.itemgetter(1))
tempResultsDf= pd.DataFrame(
{'model':[modelName],
'val_accuracy': [acc_value],
'val_loss':[modelHistory.history['val_loss'][index]],
'loss':[modelHistory.history['loss'][index]],
'accuracy':[modelHistory.history['accuracy'][index]],
'descriptions':[descriptions]},index={str(index_df)})
dataframe = pd.concat([dataframe,tempResultsDf])
dataframe = dataframe[['model','val_accuracy' ,'val_loss','loss','accuracy','descriptions']]
return dataframe
def capturePrediction(dataframe,modelName,descriptions,index_df,pred_accuracy,pred_F1, resetData):
if resetData == 1:
tempResultsDf=pd.DataFrame()
dataframe=pd.DataFrame()
else:
tempResultsDf= pd.DataFrame(
{'model':[modelName],
'Pred_Accuracy' : [pred_accuracy],
'Pred_F1' : [pred_F1],
'descriptions':[descriptions]},index={str(index_df)})
dataframe = pd.concat([dataframe,tempResultsDf])
dataframe = dataframe[['model','Pred_Accuracy','Pred_F1','descriptions']]
return dataframe
Glove Embedding
glove_file = projectpath + "glove.6B.zip"
print(glove_file)
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
z.extractall()
EMBEDDING_FILE = './glove.6B.100d.txt'
embeddings_glove = {}
for o in open(EMBEDDING_FILE):
word = o.split(" ")[0]
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
embeddings_glove[word] = embd
# defining variables for model building
maxlen = 300
numWords=9000
epochs = 10
class LstmGloveModel:
model= Model()
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
return tokenizer,dataframe
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def tokenizeAndEmbedding(self,dataframe):
tokenizer,X = self.wordTokenizer(dataframe['cleaned_description'])
y = np.asarray(dataframe['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_glove.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return X,y
def train(self, dataframe, batch_size, epochs):
X,y = self.tokenizeAndEmbedding(dataframe)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history = self.fitModel(X_train,y_train,X_Val,y_Val,batch_size, epochs)
return model_history
def fitModel(self,X_train,y_train,X_Val,y_Val,batch_size, epochs):
input_layer = Input(shape=(maxlen,),dtype=tf.int64)
embed = Embedding(numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)(input_layer)
lstm=Bidirectional(LSTM(128))(embed)
drop=Dropout(0.3)(lstm)
dense =Dense(100,activation='relu')(drop)
out=Dense(len((pd.Series(y_train)).unique()),activation='softmax')(dense)
self.model = Model(input_layer,out)
self.model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
self.model.summary()
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = self.model.fit(X_train,y_train,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val,y_Val))
return model_history,self.model
def prediction(self):
pred = self.model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy = metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("F1 score of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
# Raw data + Glove + LSTM bidirectional
lstmModelRawData = LstmGloveModel()
lstmModelRawData_history, model = lstmModelRawData.train(df_cleaned,100,epochs)
lstm_raw_accuracy , lstm_raw_F1 = lstmModelRawData.prediction()
lstmModelRawData.plotModelAccuracy(lstmModelRawData_history, 'Raw data + Glove + LSTM bidirectional')
Capture results in dataframe
results=pd.DataFrame()
pred_results = pd.DataFrame()
results=captureData(results,lstmModelRawData_history,'LSTM model_GloVe_rawdata','LSTM+GloVe Embedding on raw data','1',0)
pred_results= capturePrediction(pred_results,'LSTM model_GloVe_rawdata','LSTM+GloVe Embedding on raw data','1',lstm_raw_accuracy,lstm_raw_F1,0)
results
pred_results
class GruGloveModel:
model= Model()
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
return tokenizer,dataframe
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def tokenizeAndEmbedding(self,dataframe):
tokenizer,X = self.wordTokenizer(dataframe['cleaned_description'])
y = np.asarray(dataframe['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_glove.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return X,y
def train(self, dataframe, batch_size, epochs):
X,y = self.tokenizeAndEmbedding(dataframe)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history = self.fitModel(X_train,y_train,X_Val,y_Val,batch_size, epochs)
return model_history
def fitModel(self,X_train,y_train,X_Val,y_Val,batch_size, epochs):
input_layer = Input(shape=(maxlen,),dtype=tf.int64)
embed = Embedding(numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)(input_layer)
gru=GRU(128)(embed)
drop=Dropout(0.3)(gru)
dense =Dense(100,activation='relu')(drop)
out=Dense(len((pd.Series(y_train)).unique()),activation='softmax')(dense)
self.model = Model(input_layer,out)
self.model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
self.model.summary()
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = self.model.fit(X_train,y_train,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val,y_Val))
return model_history,self.model
def prediction(self):
pred = self.model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy=metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("F1 score of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
def plotModel(self):
self.model.summary()
# Raw data + Glove + GRU
gruModelRawData = GruGloveModel()
gruModelRawData_history, model = gruModelRawData.train(df_cleaned,100,epochs)
gruRaw_accuracy, gruRaw_f1 = gruModelRawData.prediction()
gruModelRawData.plotModelAccuracy(gruModelRawData_history, 'Raw data + Glove + GRU')
Capture results in dataframe
results=captureData(results,gruModelRawData_history,'GRU model_GloVe_rawdata','GRU+GloVe Embedding on raw data','2',0)
pred_results= capturePrediction(pred_results,'GRU model_GloVe_rawdata','GRU+GloVe Embedding on raw data','2',gruRaw_accuracy,gruRaw_f1,0)
results
pred_results
class RNNGloveModel:
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
return tokenizer,dataframe
def tokenizeAndEmbedding(self,dataframe):
tokenizer,X = self.wordTokenizer(dataframe['cleaned_description'])
y = np.asarray(dataframe['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_glove.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return X,y
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def train(self, dataframe, batch_size, epochs):
X,y = self.tokenizeAndEmbedding(dataframe)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
embed = Embedding(numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)
model=Sequential()
model.add(Input(shape=(maxlen,),dtype=tf.int64))
model.add(embed)
model.add(Conv1D(100,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.3))
model.add(Conv1D(100,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.3))
model.add(Dense(100,activation='relu'))
model.add(Dense(len((pd.Series(y_train)).unique()),activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
model.summary()
plot_model(model,to_file="RNN.jpg")
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = model.fit(X_train,y_train,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val, y_Val))
return model_history, model
def prediction(self,model):
pred = model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy = metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("Accuracy of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
RNNModelRawData = RNNGloveModel()
RNNModelRawData_history, RnnModel = RNNModelRawData.train(df_cleaned,100,epochs)
RNNModelRawData_accuracy , RNNModelRawData_f1 = RNNModelRawData.prediction(RnnModel)
RNNModelRawData.plotModelAccuracy(RNNModelRawData_history, 'Raw data + Glove + RNN')
results=captureData(results,RNNModelRawData_history,'RNN model_GloVe_rawdata','RNN+GloVe Embedding on raw data','3',0)
pred_results= capturePrediction(pred_results,'RNN model_GloVe_rawdata','RNN+GloVe Embedding on raw data','3',RNNModelRawData_accuracy,RNNModelRawData_f1,0)
results
pred_results
sentences = [line.split(' ') for line in df_cleaned['cleaned_description']]
word2vec = Word2Vec(sentences=sentences,min_count=1)
word2vec.wv.save_word2vec_format(projectpath+ 'word2vec_vector.txt')
# load the whole embedding into memory
embeddings_index = dict()
emb = open(projectpath+'word2vec_vector.txt')
for line in emb:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
emb.close()
print('Loaded %s word vectors.' % len(embeddings_index))
class LstmWord2VecModel:
model= Model()
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
return tokenizer,dataframe
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def tokenizeAndEmbedding(self,dataframe):
tokenizer,X = self.wordTokenizer(dataframe['cleaned_description'])
y = np.asarray(dataframe['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return X,y
def train(self, dataframe, batch_size, epochs):
X,y = self.tokenizeAndEmbedding(dataframe)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history = self.fitModel(X_train,y_train,X_Val,y_Val,batch_size, epochs)
return model_history
def fitModel(self,X_train,y_train,X_Val,y_Val,batch_size, epochs):
input_layer = Input(shape=(maxlen,),dtype=tf.int64)
embed = Embedding(numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)(input_layer)
lstm=Bidirectional(LSTM(128))(embed)
drop=Dropout(0.3)(lstm)
dense =Dense(100,activation='relu')(drop)
out=Dense(len((pd.Series(y_train)).unique()),activation='softmax')(dense)
self.model = Model(input_layer,out)
self.model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = self.model.fit(X_train,y_train,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val,y_Val))
return model_history,self.model
def prediction(self):
pred = self.model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy = metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("Accuracy of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
# Raw data + Word2Vec + LSTM bidirectional
lstmModelRawData = LstmWord2VecModel()
lstmModelRawData_history, model = lstmModelRawData.train(df_cleaned,100,epochs)
lstm_raw_accuracy , lstm_raw_f1 = lstmModelRawData.prediction()
lstmModelRawData.plotModelAccuracy(lstmModelRawData_history, 'Raw data + Word2Vec + LSTM bidirectional')
Capture results in dataframe
results=captureData(results,lstmModelRawData_history,'LSTM model_Word2Vec_rawdata','LSTM+Word2Vec Embedding on raw data','4',0)
pred_results= capturePrediction(pred_results,'LSTM model_Word2Vec_rawdata','LSTM+Word2Vec Embedding on raw data','4',lstm_raw_accuracy,lstm_raw_f1,0)
pred_results
class GruWord2VecModel:
model= Model()
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
return tokenizer,dataframe
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def tokenizeAndEmbedding(self,dataframe):
tokenizer,X = self.wordTokenizer(dataframe['cleaned_description'])
y = np.asarray(dataframe['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return X,y
def train(self, dataframe, batch_size, epochs):
X,y = self.tokenizeAndEmbedding(dataframe)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history = self.fitModel(X_train,y_train,X_Val,y_Val,batch_size, epochs)
return model_history
def fitModel(self,X_train,y_train,X_Val,y_Val,batch_size, epochs):
input_layer = Input(shape=(maxlen,),dtype=tf.int64)
embed = Embedding(numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)(input_layer)
gru=GRU(128)(embed)
drop=Dropout(0.3)(gru)
dense =Dense(100,activation='relu')(drop)
out=Dense(len((pd.Series(y_train)).unique()),activation='softmax')(dense)
self.model = Model(input_layer,out)
self.model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = self.model.fit(X_train,y_train,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val,y_Val))
return model_history,self.model
def prediction(self):
pred = self.model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy=metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("F1 score of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
def plotModel(self):
self.model.summary()
# Raw data + Word2Vec + GRU
gruModelRawData = GruWord2VecModel()
gruModelRawData_history, model = gruModelRawData.train(df_cleaned,100,epochs)
gruRaw_accuracy , gruRaw_f1 = gruModelRawData.prediction()
gruModelRawData.plotModelAccuracy(gruModelRawData_history, 'Raw data + Word2Vec + GRU')
Capture results in dataframe
results=captureData(results,gruModelRawData_history,'GRU model_Word2Vec_rawdata','GRU+Word2Vec Embedding on raw data','5',0)
pred_results= capturePrediction(pred_results,'GRU model_Word2Vec_rawdata','GRU+Word2Vec Embedding on raw data','5',gruRaw_accuracy,gruRaw_f1, 0)
pred_results
class RNNWord2VecModel:
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
return tokenizer,dataframe
def tokenizeAndEmbedding(self,dataframe):
tokenizer,X = self.wordTokenizer(dataframe['cleaned_description'])
y = np.asarray(dataframe['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return X,y
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def train(self, dataframe, batch_size, epochs):
X,y = self.tokenizeAndEmbedding(dataframe)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
embed = Embedding(numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)
model=Sequential()
model.add(Input(shape=(maxlen,),dtype=tf.int64))
model.add(embed)
model.add(Conv1D(100,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.3))
model.add(Conv1D(100,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.3))
model.add(Dense(100,activation='relu'))
model.add(Dense(len((pd.Series(y_train)).unique()),activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
model.summary()
plot_model(model,to_file="RNN.jpg")
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = model.fit(X_train,y_train,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val, y_Val))
return model_history, model
def prediction(self,model):
pred = model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy = metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("F1 score of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
RNNModelRawData = RNNWord2VecModel()
RNNModelRawData_history, RnnModel = RNNModelRawData.train(df_cleaned,100,epochs)
RNNModelRawData_accuracy , RNNModelRawData_f1 = RNNModelRawData.prediction(RnnModel)
RNNModelRawData.plotModelAccuracy(RNNModelRawData_history, 'Raw data + Word2Vec + RNN')
results=captureData(results,RNNModelRawData_history,'RNN model_Word2Vec_rawdata','RNN+Word2Vec Embedding on raw data','6',0)
pred_results= capturePrediction(pred_results,'RNN model_Word2Vec_rawdata','RNN+Word2Vec Embedding on raw data','6',RNNModelRawData_accuracy,RNNModelRawData_f1,0)
pred_results
LSTM and GRU - Glove model has performed better considering the Accuracy and F1 scores.
Future Scope:
The models are to built after resampling the target variable. Combination of advanced NLP models with attention and transfer learning approaches are to be built. The models are analysed based on accuracy and f1-score.
Since the data for Group-0 is more compared to other groups, resampling is done in two ways.
df_cleaned['Assignment group'].value_counts()
There is high imbalance in the target dataset
df_cleaned_others = df_cleaned[df_cleaned['Assignment group'] != 0]
df_cleaned_GRP0 = df_cleaned.copy()
df_cleaned_GRP0['Assignment group']=df_cleaned_GRP0['Assignment group'].apply(lambda x : 'other' if x != 0 else x)
maxValue_others = df_cleaned_others['Assignment group'].value_counts().max()
maxValue_others
from sklearn.utils import resample
df_cleaned_others_resampled = df_cleaned_others[0:0]
for grp in df_cleaned_others['Assignment group'].unique():
itTicketGrpDF = df_cleaned_others[df_cleaned_others['Assignment group'] == grp]
resampled = resample(itTicketGrpDF, replace=True, n_samples=int(maxValue_others/2), random_state=55)
df_cleaned_others_resampled = df_cleaned_others_resampled.append(resampled)
otherGrps_resampled = pd.concat([df_cleaned_GRP0,df_cleaned_others_resampled])
otherGrps_resampled.reset_index(inplace=True)
Created one dataframe with other than Group-0 target resampled to the mid value 331
df_cleaned_resampled = df_cleaned[0:0]
for grp in df_cleaned['Assignment group'].unique():
itTicketGrpDF = df_cleaned[df_cleaned['Assignment group'] == grp]
resampled = resample(itTicketGrpDF, replace=True, n_samples=int(maxValue_others), random_state=55)
df_cleaned_resampled = df_cleaned_resampled.append(resampled)
Resampled the complete data to max length of other group by downsampling the Group-0 and upsampling rest groups.
def plot_groups(df):
descending_order = df['Assignment group'].value_counts().sort_values(ascending=False).index
plt.subplots(figsize=(22,5))
ax=sns.countplot(x='Assignment group', data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
plt.tight_layout()
plt.show()
plot_groups(df_cleaned_GRP0)
plot_groups(df_cleaned_others_resampled)
plot_groups(df_cleaned_resampled)
Label encoding and getting groups for resampled data
otherGrps_resampled['Assignment group'] = otherGrps_resampled['Assignment group'].astype('str')
otherGrps_resampled['Assignment group'] , grp_mapping_others_resampled= labelencoder(otherGrps_resampled['Assignment group'])
df_cleaned_resampled['Assignment group'] , grp_mapping_all_resampled= labelencoder(df_cleaned_resampled['Assignment group'])
LSTM + W2V for two model approach
class two_model_approach:
model_1 = Model()
model_2 = Model()
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return tokenizer,dataframe
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def runFirstModel(self,dataframe,epochs):
grp0_df = dataframe.copy()
# redefine the target as 0 for Group_0 and 1 for other groups
grp0_df['Assignment group']=dataframe['Assignment group'].apply(lambda x : 1 if x != 0 else x)
tokenizer,X = self.wordTokenizer(grp0_df['cleaned_description'])
y = np.asarray(grp0_df['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history,self.model_1 = self.modelRunner(X_train,y_train,X_Val,y_Val,epochs)
return model_history,self.model_1
def runSecondModel(self, dataframe,epochs):
grpOthers_df = dataframe.copy()
# removing the group 0 from target groups
grpOthers_df = grpOthers_df[grpOthers_df['Assignment group'] != 0]
# reducing the length to remove encoded 'Others' from end
grpOthers_df['Assignment group']=grpOthers_df['Assignment group'] - 1
tokenizer,X = self.wordTokenizer(grpOthers_df['cleaned_description'])
y = np.asarray(grpOthers_df['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history,self.model_2 = self.modelRunner(X_train,y_train,X_Val,y_Val,epochs)
return model_history,self.model_2
def modelRunner(self, X,Y,X_Val,Y_Val,epochs):
input_layer = Input(shape=(maxlen,),dtype=tf.int64)
embed = Embedding(input_dim = numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)(input_layer)
lstm=Bidirectional(LSTM(128))(embed)
drop=Dropout(0.3)(lstm)
dense =Dense(100,activation='relu')(drop)
out=Dense(len((pd.Series(Y)).unique()),activation='softmax')(dense)
batch_size = 100
model = Model(input_layer,out)
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = model.fit(X,Y,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val,Y_Val))
return model_history,model
def predict(self, X_test):
predBinary = self.model_1.predict(X_test)
predBinary = [1 if j>i else 0 for i,j in predBinary]
new_X_test = pd.DataFrame(X_test)
new_X_test['grp']=predBinary
sec_input = new_X_test[new_X_test['grp']!=0]
sec_input.drop(['grp'],inplace=True, axis=1)
new_X_test=new_X_test[new_X_test['grp']==0]
predOther = self.model_2.predict(sec_input)
predOther = [i.argmax() for i in predOther]
predOther= [i+1 for i in predOther]
sec_input['grp']=predOther
pred_df = pd.concat([new_X_test,sec_input])
pred_df.sort_index(axis=0,inplace=True)
return np.array(pred_df['grp'])
def prediction(self,model):
pred = model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy = metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("F1 score of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
model = two_model_approach()
model1_history,model1 = model.runFirstModel(otherGrps_resampled,10)
model2_history,model2 = model.runSecondModel(otherGrps_resampled,15)
model.plotModelAccuracy(model1_history, 'GRP0 vs Other')
model.plotModelAccuracy(model2_history, 'Other')
two_part_model_accuracy , two_part_model_f1 = model.prediction(model)
resampled_results = pd.DataFrame()
resampled_pred_results = pd.DataFrame()
resampled_results=captureData(resampled_results,model1_history,'Two part model-LSTM_W2V_grp0','LSTM+Word2Vec Embedding on grp0_data','1',0)
resampled_results=captureData(resampled_results,model2_history,'Two part model-LSTM_W2V_Others','LSTM+Word2Vec Embedding on Rest of groups','1',0)
resampled_pred_results= capturePrediction(resampled_pred_results,'Two part model-LSTM_W2V','Two part model + word2vec + LSTM bidirectional','1',two_part_model_accuracy , two_part_model_f1,0)
resampled_pred_results
LSTM+Glove Embedding for Two model approach
class two_model_approach:
model_1 = Model()
model_2 = Model()
X_test=[]
y_test=[]
embedding_matrix=[]
def wordTokenizer(self, dataframe):
tokenizer = Tokenizer(num_words=numWords,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(dataframe)
dataframe = tokenizer.texts_to_sequences(dataframe)
self.embedding_matrix = np.zeros((numWords+1, 100))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_glove.get(word)
if embedding_vector is not None:
self.embedding_matrix[i] = embedding_vector
return tokenizer,dataframe
def splitData(self,X,y):
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.2, random_state=10)
print("Number of train Samples:", len(X_train))
print("Number of val Samples:", len(X_Val))
return X_train, self.X_test, y_train, self.y_test, X_Val, y_Val
def runFirstModel(self,dataframe,epochs):
grp0_df = dataframe.copy()
# redefine the target as 0 for Group_0 and 1 for other groups
grp0_df['Assignment group']=dataframe['Assignment group'].apply(lambda x : 1 if x != 0 else x)
tokenizer,X = self.wordTokenizer(grp0_df['cleaned_description'])
y = np.asarray(grp0_df['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history,self.model_1 = self.modelRunner(X_train,y_train,X_Val,y_Val,epochs)
return model_history,self.model_1
def runSecondModel(self, dataframe,epochs):
grpOthers_df = dataframe.copy()
# removing the group 0 from target groups
grpOthers_df = grpOthers_df[grpOthers_df['Assignment group'] != 0]
# reducing the length to remove encoded 'Others' from end
grpOthers_df['Assignment group']=grpOthers_df['Assignment group'] - 1
tokenizer,X = self.wordTokenizer(grpOthers_df['cleaned_description'])
y = np.asarray(grpOthers_df['Assignment group'])
X = pad_sequences(X, maxlen = maxlen)
X_train, _, y_train, _, X_Val, y_Val = self.splitData(X,y)
model_history,self.model_2 = self.modelRunner(X_train,y_train,X_Val,y_Val,epochs)
return model_history,self.model_2
def modelRunner(self, X,Y,X_Val,Y_Val,epochs):
input_layer = Input(shape=(maxlen,),dtype=tf.int64)
embed = Embedding(input_dim = numWords+1,output_dim=100,input_length=maxlen,weights=[self.embedding_matrix], trainable=True)(input_layer)
lstm=Bidirectional(LSTM(128))(embed)
drop=Dropout(0.3)(lstm)
dense =Dense(100,activation='relu')(drop)
out=Dense(len((pd.Series(Y)).unique()),activation='softmax')(dense)
batch_size = 100
model = Model(input_layer,out)
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = model.fit(X,Y,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(X_Val,Y_Val))
return model_history,model
def predict(self, X_test):
predBinary = self.model_1.predict(X_test)
predBinary = [1 if j>i else 0 for i,j in predBinary]
new_X_test = pd.DataFrame(X_test)
new_X_test['grp']=predBinary
sec_input = new_X_test[new_X_test['grp']!=0]
sec_input.drop(['grp'],inplace=True, axis=1)
new_X_test=new_X_test[new_X_test['grp']==0]
predOther = self.model_2.predict(sec_input)
predOther = [i.argmax() for i in predOther]
predOther= [i+1 for i in predOther]
sec_input['grp']=predOther
pred_df = pd.concat([new_X_test,sec_input])
pred_df.sort_index(axis=0,inplace=True)
return np.array(pred_df['grp'])
def prediction(self,model):
pred = model.predict(self.X_test)
pred = [i.argmax() for i in pred]
accuracy = metrics.accuracy_score(self.y_test, pred)
print("Accuracy of the model :",accuracy)
print("Accuracy of the model :",accuracy)
f1_score = metrics.f1_score(self.y_test, pred,average='weighted')
print("F1 score of the model :",f1_score)
return accuracy,f1_score
def plotModelAccuracy(self, history, modelname):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title(modelname+' model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title(modelname+' model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
model_2 = two_model_approach()
model1_history,model1 = model_2.runFirstModel(otherGrps_resampled,10)
model2_history,model2 = model_2.runSecondModel(otherGrps_resampled,15)
model_2.plotModelAccuracy(model1_history, 'GRP0 vs Other')
model_2.plotModelAccuracy(model2_history, 'Other')
two_part_model_accuracy , two_part_model_f1 = model_2.prediction(model_2)
resampled_results=captureData(resampled_results,model1_history,'Two part model-LSTM_glove_grp0','LSTM+glove Embedding on grp0_data','2',0)
resampled_results=captureData(resampled_results,model2_history,'Two part model-LSTM_glove_Others','LSTM+glove Embedding on Rest of groups','2',0)
resampled_pred_results= capturePrediction(resampled_pred_results,'Two part model-LSTM_glove','Two part model + glove + LSTM bidirectional','2',two_part_model_accuracy , two_part_model_f1,0)
resampled_pred_results
Since two part model for both type of embedding is not giving good result, not experimenting with other model+embedding combination for this approach.
LSTM+Glove+Resampled data
lstmModel_resampled = LstmGloveModel()
lstmModel_resampled_history, model = lstmModel_resampled.train(df_cleaned_resampled,100,epochs)
lstm_resampled_accuracy , lstm_resampled_F1 = lstmModel_resampled.prediction()
lstmModel_resampled.plotModelAccuracy(lstmModel_resampled_history, 'Resampled data + Glove + LSTM bidirectional')
resampled_results=captureData(resampled_results,lstmModel_resampled_history,'Resampled data + Glove + LSTM bidirectional','Resampled data + Glove + LSTM bidirectional','3',0)
resampled_pred_results= capturePrediction(resampled_pred_results,'Resampled data + Glove + LSTM bidirectional','Resampled data + Glove + LSTM bidirectional','3',lstm_resampled_accuracy,lstm_resampled_F1,0)
resampled_pred_results
GRU+Glove+Resampled data
gruModel_resampled = GruGloveModel()
gruModel_resampled_history, model = gruModel_resampled.train(df_cleaned_resampled,100,epochs)
gruModel_resampled_accuracy, gruModel_resampled_f1 = gruModel_resampled.prediction()
gruModel_resampled.plotModelAccuracy(gruModel_resampled_history, 'Resampled data + Glove + GRU')
resampled_results=captureData(resampled_results,gruModel_resampled_history,'Resampled data + Glove + GRU','Resampled data + Glove +GRU','4',0)
resampled_pred_results= capturePrediction(resampled_pred_results,'Resampled data + Glove + GRU','Resampled data + Glove + GRU','4',gruModel_resampled_accuracy,gruModel_resampled_f1,0)
resampled_pred_results